Introduction
In this report, we will be doing a Principal Component analysis based
on Spotify data on popular songs.
A Principal Component Analysis is a technique to analyze large
datasets containing multiple dimensions/features. The purpose is to
enable visualization of multidimensional data.
PCA identifies the main axes of variance within a data set and allows
for easy data exploration to understand the key variables in the data
and spot outliers.
The goal is to determine which variables are related or not related
to each other.
library(tidyverse)
library(janitor)
Importing the data
We are using a dataset of popular songs on Spotify. In the previous
report on pre-processing, descriptive and bivariate statistics, we have
described this dataset and made several transformations.
Link
to original dataset
setwd("~/Documents/class/stats-final-project/")
Warning: The working directory was changed to /Users/sadeline/Documents/class/stats-final-project inside a notebook chunk. The working directory will be reset when the chunk is finished running. Use the knitr root.dir option in the setup chunk to change the working directory for notebook chunks.
dd <- read.csv("cleaneddata.csv")
Plotting the individuals on PC1 and PC2 axes
# PLOT OF INDIVIDUALS
#select your axis
#eje1<-2
eje1<-1
#eje2<-3
eje2<-2
plot(Psi[,eje1],Psi[,eje2])
text(Psi[,eje1],Psi[,eje2],labels=iden, cex=0.5)
axis(side=1, pos= 0, labels = F, col="cyan")
axis(side=3, pos= 0, labels = F, col="cyan")
axis(side=2, pos= 0, labels = F, col="cyan")
axis(side=4, pos= 0, labels = F, col="cyan")

Plotting projection of variables, PC1 and PC2
We will create a loadings plot to see which variables most influence
PC1 and PC2.
#Projection of variables
Phi = cor(dcon,Psi)
Phi
PC1 PC2 PC3 PC4 PC5 PC6 PC7
popularity -0.24963532 0.08737373 0.06928732 -0.76486554 0.19582865 0.022250675 0.13691622
duration_ms -0.09862125 0.57086629 -0.24944613 0.08729430 0.41867774 -0.015558089 0.59669195
danceability -0.40179818 -0.59488199 -0.35184714 0.25033708 0.12847324 -0.109087549 0.29817715
energy -0.84361522 0.30812067 0.10770418 0.19022208 -0.09452873 -0.124664432 -0.12816114
loudness -0.86472495 0.08594352 0.05548540 -0.07341474 -0.10423578 -0.146350079 -0.09242656
speechiness -0.12184739 -0.19649985 0.58525297 0.45771434 0.08763029 0.193391954 0.20368192
acousticness 0.70188772 -0.44726772 0.12371687 -0.19446711 0.07742870 0.206281277 0.12109473
instrumentalness 0.52724351 0.33062244 -0.24448643 0.46281173 -0.05225130 0.021319812 -0.07477844
liveness -0.16028913 0.07616872 0.75382525 0.04841263 0.23748555 0.108454828 0.04399959
valence -0.54570179 -0.62066253 -0.14382045 0.03690987 -0.08294104 -0.007140834 0.12850532
tempo -0.36049724 0.18146229 -0.17543765 -0.04986269 -0.43193432 0.765021423 0.14563189
time_signature -0.30147566 -0.09267949 -0.27414163 0.08815102 0.68698749 0.380526932 -0.44354561
X<-Phi[,eje1]
Y<-Phi[,eje2]
#zooms
plot(Psi[,eje1],Psi[,eje2],type="n",xlim=c(min(X,0),max(X,0)), ylim=c(-1,1))
axis(side=1, pos= 0, labels = F)
axis(side=3, pos= 0, labels = F)
axis(side=2, pos= 0, labels = F)
axis(side=4, pos= 0, labels = F)
arrows(ze, ze, X, Y, length = 0.07,col="blue")
text(X,Y,labels=etiq,col="darkblue", cex=0.7)

Observations:
Based on the Loadings plot, we see that the variables that most
influence PC1 are instrumentalness, energy, loudness, and
acousticness.
The variables that most influence PC2 are duration.
From this plot we can also see that these variables may be positively
correlated: - Loudness and energy - Valence and danceability -
Instrumentalness and acousticness
These variables may be negatively correlated: - Loudness and
instrumentalness - Loudness and acousticness - Energy and acousticness -
Valence and instrumentalness - Tempo and acousticness
We can also try plotting the PC1 against PC3 to see the variables
that are not so clearly visible here.
Plotting projection of variables, PC1 and PC3
Since PC1 and PC2 only accounts for 38% of the variation, we should
also plot projection of variables in PC1 and PC3.
#Projection of variables
Phi = cor(dcon,Psi)
Phi
PC1 PC2 PC3 PC4 PC5 PC6 PC7
popularity -0.24963532 0.08737373 0.06928732 -0.76486554 0.19582865 0.022250675 0.13691622
duration_ms -0.09862125 0.57086629 -0.24944613 0.08729430 0.41867774 -0.015558089 0.59669195
danceability -0.40179818 -0.59488199 -0.35184714 0.25033708 0.12847324 -0.109087549 0.29817715
energy -0.84361522 0.30812067 0.10770418 0.19022208 -0.09452873 -0.124664432 -0.12816114
loudness -0.86472495 0.08594352 0.05548540 -0.07341474 -0.10423578 -0.146350079 -0.09242656
speechiness -0.12184739 -0.19649985 0.58525297 0.45771434 0.08763029 0.193391954 0.20368192
acousticness 0.70188772 -0.44726772 0.12371687 -0.19446711 0.07742870 0.206281277 0.12109473
instrumentalness 0.52724351 0.33062244 -0.24448643 0.46281173 -0.05225130 0.021319812 -0.07477844
liveness -0.16028913 0.07616872 0.75382525 0.04841263 0.23748555 0.108454828 0.04399959
valence -0.54570179 -0.62066253 -0.14382045 0.03690987 -0.08294104 -0.007140834 0.12850532
tempo -0.36049724 0.18146229 -0.17543765 -0.04986269 -0.43193432 0.765021423 0.14563189
time_signature -0.30147566 -0.09267949 -0.27414163 0.08815102 0.68698749 0.380526932 -0.44354561
eje3 <- 3
X<-Phi[,eje1]
Y<-Phi[,eje3]
#zooms
plot(Psi[,eje1],Psi[,eje3],type="n",xlim=c(min(X,0),max(X,0)), ylim=c(-1,1))
axis(side=1, pos= 0, labels = F)
axis(side=3, pos= 0, labels = F)
axis(side=2, pos= 0, labels = F)
axis(side=4, pos= 0, labels = F)
arrows(ze, ze, X, Y, length = 0.07,col="blue")
text(X,Y,labels=etiq,col="darkblue", cex=0.7)

Based on the above Loadings plot, we see that the variables that most
influence PC3 are popularity, liveness and duration.
It also seems like Popularity and liveness are closely related, while
it may be negatively correlated with duration.
We can also see that speechiness is closer to danceability.
Finding Centroids
We can also find the centroids of modalities in categorical
variables, using the code below.
X<-Phi[,eje1]
Y<-Phi[,eje2]
plot(Psi[,eje1],Psi[,eje2],type="n",xlim=c(-1,1), ylim=c(-3,1))
#plot(X,Y,type="none",xlim=c(min(X,0),max(X,0)))
axis(side=1, pos= 0, labels = F, col="cyan")
axis(side=3, pos= 0, labels = F, col="cyan")
axis(side=2, pos= 0, labels = F, col="cyan")
axis(side=4, pos= 0, labels = F, col="cyan")
arrows(ze, ze, X, Y, length = 0.07,col="lightgray")
text(X,Y,labels=etiq,col="gray", cex=0.7)

#all qualitative together
plot(Psi[,eje1],Psi[,eje2],type="n",xlim=c(-2,4), ylim=c(-2,2))
axis(side=1, pos= 0, labels = F, col="cyan")
axis(side=3, pos= 0, labels = F, col="cyan")
axis(side=2, pos= 0, labels = F, col="cyan")
axis(side=4, pos= 0, labels = F, col="cyan")
arrows(ze, ze, X, Y, length = 0.07,col="lightgray")
text(X,Y,labels=etiq,col="gray", cex=0.7)
#nominal qualitative variables
dcat<-c(3, 6, 8, 16, 17)
colors<-rainbow(length(dcat))
c<-1
for(k in dcat){
seguentColor<-colors[c]
fdic1 = tapply(Psi[,eje1],dd[,k],mean)
fdic2 = tapply(Psi[,eje2],dd[,k],mean)
text(fdic1,fdic2,labels=levels(factor(dd[,k])),col=seguentColor, cex=0.6)
c<-c+1
}
legend("bottomleft",names(dd)[dcat],pch=1,col=colors, cex=0.6)
#add ordinal qualitative variables. Ensure ordering is the correct
dordi<-c(18)
#reorder modalities: when required
dd[,dordi[1]] <- factor(dd[,dordi[1]], ordered=TRUE, levels= c('Larghissimo','Grave','Lento/Largo','Larghetto','Adagio','Andante','Moderato','Allegro','Vivace','Presto','Prestissimo'))
levels(dd[,dordi[1]])
[1] "Larghissimo" "Grave" "Lento/Largo" "Larghetto" "Adagio" "Andante" "Moderato"
[8] "Allegro" "Vivace" "Presto" "Prestissimo"
c<-1
col<-length(dcat)+1
for(k in dordi){
seguentColor<-colors[col]
fdic1 = tapply(Psi[,eje1],dd[,k],mean)
fdic2 = tapply(Psi[,eje2],dd[,k],mean)
#points(fdic1,fdic2,pch=16,col=seguentColor, labels=levels(dd[,k]))
#connect modalities of qualitative variables
lines(fdic1,fdic2,col="#000000")
text(fdic1,fdic2,labels=levels(dd[,k]),col=seguentColor, cex=0.6)
c<-c+1
col<-col+1
}
legend("topleft",names(dd)[dordi],pch=19,col=colors[col:col+length(dordi)-1], cex=0.6)

That looks a bit crowded, so let’s try looking at the modalities one
by one.
Explicit
#all qualitative together
plot(Psi[,eje1],Psi[,eje2],type="n",xlim=c(-1,1), ylim=c(-1,1))
axis(side=1, pos= 0, labels = F, col="cyan")
axis(side=3, pos= 0, labels = F, col="cyan")
axis(side=2, pos= 0, labels = F, col="cyan")
axis(side=4, pos= 0, labels = F, col="cyan")
arrows(ze, ze, X, Y, length = 0.07,col="lightgray")
text(X,Y,labels=etiq,col="gray", cex=0.7)
#nominal qualitative variables
dcat<-c(3)
colors<-rainbow(length(dcat))
c<-1
for(k in dcat){
seguentColor<-colors[c]
fdic1 = tapply(Psi[,eje1],dd[,k],mean)
fdic2 = tapply(Psi[,eje2],dd[,k],mean)
text(fdic1,fdic2,labels=levels(factor(dd[,k])),col=seguentColor, cex=0.6)
c<-c+1
}
legend("bottomleft",names(dd)[dcat],pch=1,col=colors, cex=0.6)

NA
NA
Key
#all qualitative together
plot(Psi[,eje1],Psi[,eje2],type="n",xlim=c(-1,1), ylim=c(-1,1))
axis(side=1, pos= 0, labels = F, col="cyan")
axis(side=3, pos= 0, labels = F, col="cyan")
axis(side=2, pos= 0, labels = F, col="cyan")
axis(side=4, pos= 0, labels = F, col="cyan")
arrows(ze, ze, X, Y, length = 0.07,col="lightgray")
text(X,Y,labels=etiq,col="gray", cex=0.7)
#nominal qualitative variables
dcat<-c(6)
colors<-rainbow(length(dcat))
c<-1
for(k in dcat){
seguentColor<-colors[c]
fdic1 = tapply(Psi[,eje1],dd[,k],mean)
fdic2 = tapply(Psi[,eje2],dd[,k],mean)
text(fdic1,fdic2,labels=levels(factor(dd[,k])),col=seguentColor, cex=0.6)
c<-c+1
}
legend("bottomleft",names(dd)[dcat],pch=1,col=colors, cex=0.6)

NA
NA
Mode
#all qualitative together
plot(Psi[,eje1],Psi[,eje2],type="n",xlim=c(-1,1), ylim=c(-1,1))
axis(side=1, pos= 0, labels = F, col="cyan")
axis(side=3, pos= 0, labels = F, col="cyan")
axis(side=2, pos= 0, labels = F, col="cyan")
axis(side=4, pos= 0, labels = F, col="cyan")
arrows(ze, ze, X, Y, length = 0.07,col="lightgray")
text(X,Y,labels=etiq,col="gray", cex=0.7)
#nominal qualitative variables
dcat<-c(8)
colors<-rainbow(length(dcat))
c<-1
for(k in dcat){
seguentColor<-colors[c]
fdic1 = tapply(Psi[,eje1],dd[,k],mean)
fdic2 = tapply(Psi[,eje2],dd[,k],mean)
text(fdic1,fdic2,labels=levels(factor(dd[,k])),col=seguentColor, cex=0.6)
c<-c+1
}
legend("bottomleft",names(dd)[dcat],pch=1,col=colors, cex=0.6)

NA
NA
Multiple artists
#all qualitative together
plot(Psi[,eje1],Psi[,eje2],type="n",xlim=c(-1,3), ylim=c(-2,2))
axis(side=1, pos= 0, labels = F, col="cyan")
axis(side=3, pos= 0, labels = F, col="cyan")
axis(side=2, pos= 0, labels = F, col="cyan")
axis(side=4, pos= 0, labels = F, col="cyan")
arrows(ze, ze, X, Y, length = 0.07,col="lightgray")
text(X,Y,labels=etiq,col="gray", cex=0.7)
#nominal qualitative variables
dcat<-c(17)
colors<-rainbow(length(dcat))
c<-1
for(k in dcat){
seguentColor<-colors[c]
fdic1 = tapply(Psi[,eje1],dd[,k],mean)
fdic2 = tapply(Psi[,eje2],dd[,k],mean)
text(fdic1,fdic2,labels=levels(factor(dd[,k])),col=seguentColor, cex=0.6)
c<-c+1
}
legend("bottomleft",names(dd)[dcat],pch=1,col=colors, cex=0.6)

NA
NA
All except genre
#all qualitative together
plot(Psi[,eje1],Psi[,eje2],type="n",xlim=c(-1,3), ylim=c(-2,2))
axis(side=1, pos= 0, labels = F, col="cyan")
axis(side=3, pos= 0, labels = F, col="cyan")
axis(side=2, pos= 0, labels = F, col="cyan")
axis(side=4, pos= 0, labels = F, col="cyan")
arrows(ze, ze, X, Y, length = 0.07,col="lightgray")
text(X,Y,labels=etiq,col="gray", cex=0.7)
#nominal qualitative variables
dcat<-c(3,6,8,17)
colors<-rainbow(length(dcat))
c<-1
for(k in dcat){
seguentColor<-colors[c]
fdic1 = tapply(Psi[,eje1],dd[,k],mean)
fdic2 = tapply(Psi[,eje2],dd[,k],mean)
text(fdic1,fdic2,labels=levels(factor(dd[,k])),col=seguentColor, cex=0.6)
c<-c+1
}
legend("bottomleft",names(dd)[dcat],pch=1,col=colors, cex=0.6)
#add ordinal qualitative variables. Ensure ordering is the correct
dordi<-c(18)
#reorder modalities: when required
dd[,dordi[1]] <- factor(dd[,dordi[1]], ordered=TRUE, levels= c('Larghissimo','Grave','Lento/Largo','Larghetto','Adagio','Andante','Moderato','Allegro','Vivace','Presto','Prestissimo'))
levels(dd[,dordi[1]])
[1] "Larghissimo" "Grave" "Lento/Largo" "Larghetto" "Adagio" "Andante" "Moderato"
[8] "Allegro" "Vivace" "Presto" "Prestissimo"
c<-1
col<-length(dcat)+1
for(k in dordi){
seguentColor<-colors[col]
fdic1 = tapply(Psi[,eje1],dd[,k],mean)
fdic2 = tapply(Psi[,eje2],dd[,k],mean)
#points(fdic1,fdic2,pch=16,col=seguentColor, labels=levels(dd[,k]))
#connect modalities of qualitative variables
lines(fdic1,fdic2,col="#000000")
text(fdic1,fdic2,labels=levels(dd[,k]),col=seguentColor, cex=0.6)
c<-c+1
col<-col+1
}
legend("topleft",names(dd)[dordi],pch=19,col=colors[col:col+length(dordi)-1], cex=0.6)

NA
NA
#all qualitative together
plot(Psi[,eje1],Psi[,eje2],type="n",xlim=c(-1,3), ylim=c(-1,1))
axis(side=1, pos= 0, labels = F, col="cyan")
axis(side=3, pos= 0, labels = F, col="cyan")
axis(side=2, pos= 0, labels = F, col="cyan")
axis(side=4, pos= 0, labels = F, col="cyan")
arrows(ze, ze, X, Y, length = 0.07,col="lightgray")
text(X,Y,labels=etiq,col="gray", cex=0.7)
#add ordinal qualitative variables. Ensure ordering is the correct
dordi<-c(18)
#reorder modalities: when required
dd[,dordi[1]] <- factor(dd[,dordi[1]], ordered=TRUE, levels= c('Larghissimo','Grave','Lento/Largo','Larghetto','Adagio','Andante','Moderato','Allegro','Vivace','Presto','Prestissimo'))
levels(dd[,dordi[1]])
[1] "Larghissimo" "Grave" "Lento/Largo" "Larghetto" "Adagio" "Andante" "Moderato"
[8] "Allegro" "Vivace" "Presto" "Prestissimo"
c<-1
col<-length(dcat)+1
for(k in dordi){
seguentColor<-colors[col]
fdic1 = tapply(Psi[,eje1],dd[,k],mean)
fdic2 = tapply(Psi[,eje2],dd[,k],mean)
#points(fdic1,fdic2,pch=16,col=seguentColor, labels=levels(dd[,k]))
#connect modalities of qualitative variables
lines(fdic1,fdic2,col="#000000")
text(fdic1,fdic2,labels=levels(dd[,k]),col=seguentColor, cex=0.6)
c<-c+1
col<-col+1
}
legend("topleft",names(dd)[dordi],pch=19,col=colors[col:col+length(dordi)-1], cex=0.6)

NA
NA
Observations from the plot of centroids above:
- Songs that are explicit are closely related to heavy metal and
grunge songs.
- Disney, jazz, and classical-tonk songs are probably high on the
instrumental scale, while new-age, ambient, sleep songs are high on the
acousticness scale.
- Seems like the slower songs (Grave, Lento, Larghetto, Adagio) are
also on the right side of the plot, which are more acoustic,
instrumental songs. Whereas the faster songs are on the left side.
Coloring the PCA plot using categorical variables
We can also plot all the individuals in PC1 and PC2 as axes, and
color-code it by categorical variables. We’ll examine one categorical
variable: Explicitness
# PROJECTION OF ILLUSTRATIVE qualitative variables on individuals' map
varcat=factor(dd[,3])
plot(Psi[,1],Psi[,2],col=c("grey", "red")[varcat])
axis(side=1, pos= 0, labels = F, col="darkgray")
axis(side=3, pos= 0, labels = F, col="darkgray")
axis(side=2, pos= 0, labels = F, col="darkgray")
axis(side=4, pos= 0, labels = F, col="darkgray")
legend("bottomleft",levels(varcat),pch=1,col=c("grey", "red"), cex=0.6)
# Overproject THE CDG OF LEVELS OF varcat
fdic1 = tapply(Psi[,1],varcat,mean)
fdic2 = tapply(Psi[,2],varcat,mean)
text(fdic1,fdic2,labels=levels(factor(varcat)),col="cyan", cex=0.75)

Although there are not many explicit songs, we can see that the
explicit songs tend to be on the left side of PC1.
# PROJECTION OF ILLUSTRATIVE qualitative variables on individuals' map
varcat=factor(dd[,18])
plot(Psi[,1],Psi[,2],col=rainbow(9)[varcat])
axis(side=1, pos= 0, labels = F, col="darkgray")
axis(side=3, pos= 0, labels = F, col="darkgray")
axis(side=2, pos= 0, labels = F, col="darkgray")
axis(side=4, pos= 0, labels = F, col="darkgray")
legend("bottomleft",levels(varcat),pch=1,col=rainbow(9), cex=0.6)
# Overproject THE CDG OF LEVELS OF varcat
fdic1 = tapply(Psi[,1],varcat,mean)
fdic2 = tapply(Psi[,2],varcat,mean)
text(fdic1,fdic2,labels=levels(factor(varcat)),col="cyan", cex=0.75)

From the tempo categories plot we see that the left side of PCA is
the faster songs, and the right side are the slower songs.
---
title: "Spotify data - Principal Component Analysis"
output:
  word_document: default
  html_notebook: default
---

## Introduction

In this report, we will be doing a Principal Component analysis based on Spotify data on popular songs.

A Principal Component Analysis is a technique to analyze large datasets containing multiple dimensions/features. The purpose is to enable visualization of multidimensional data.

PCA identifies the main axes of variance within a data set and allows for easy data exploration to understand the key variables in the data and spot outliers.

The goal is to determine which variables are related or not related to each other.

```{r, results=FALSE}
library(tidyverse)
library(janitor)
```

## Importing the data

We are using a dataset of popular songs on Spotify. In the previous report on pre-processing, descriptive and bivariate statistics, we have described this dataset and made several transformations.

[Link to original dataset](https://www.kaggle.com/datasets/maharshipandya/-spotify-tracks-dataset)

```{r}
setwd("~/Documents/class/stats-final-project/")
dd <- read.csv("cleaneddata.csv")
```

## Performing the Principal Component Analysis

We use the `prcomp()` function.

This returns 3 things:

1. x => contains the principal components (PCs) for drawing a graph. We will be using the first two columns in x to draw a 2D plot that uses the first 2 PCs. The number of PCs is determined by the number of variables used.

2. sdev

3. rotation

Below is the code to run the prcomp function, and a basic plot of principal component 1 and 2.

```{r}
# determine which ones are numerical variables
numerical <- which(sapply(dd,is.numeric))
# print below to see if the numerical variables are correctly detected
# numerical

# saving the numerical observations to "dcon"
dcon <- dd[,numerical]
# print below to see if variables detected are indeed numerical
# sapply(dcon,class)

# Now we do a PRINCIPAL COMPONENT ANALYSIS on the numerical variables
pca <- prcomp(dcon, scale=TRUE, center = TRUE) # centering and scaling true
pc1 <- pca

plot(pca$x[,1], pca$x[,2])

```

### Scree plot

With the principal component analysis, we can also compute the Scree plot, which displays the variance accounted for by the components.

```{r}
pca.var <- pca$sdev^2
pca.var.per <- round(pca.var/sum(pca.var)*100, 1)
pca.var.per
barplot(pca.var.per, main="Scree Plot", xlab="Principal Component", ylab="Percent Variation")

#Cummulated Inertia in subspaces, from first principal component to the 11th dimension subspace
barplot(100*cumsum(pc1$sdev[1:dim(dcon)[2]]^2)/dim(dcon)[2])
percInerAccum<-100*cumsum(pc1$sdev[1:dim(dcon)[2]]^2)/dim(dcon)[2]
percInerAccum
```

From this we see that Principal component 1 accounts for only 25.2% of the variation. And in order to account for 80% of the variation, we need to include PC 1-7.

Next we will store the eigenvalues, eigenvectors, projections and include PC 1-7.

```{r, results=FALSE}
# SELECTION OF THE SINGIFICANT DIMENSIONS (keep 80% of total inertia)
nd = 7
pc1$rotation

# STORAGE OF THE EIGENVALUES, EIGENVECTORS AND PROJECTIONS IN THE nd DIMENSIONS
Psi = pc1$x[,1:nd]
Psi


# STORAGE OF LABELS FOR INDIVIDUALS AND VARIABLES
iden = row.names(dcon)
etiq = names(dcon) # getting names of numerical variables
ze = rep(0,length(etiq)) # WE WILL NEED THIS VECTOR AFTERWARDS FOR THE GRAPHICS
```

## Plotting the individuals on PC1 and PC2 axes

```{r}
# PLOT OF INDIVIDUALS

#select your axis
#eje1<-2
eje1<-1
#eje2<-3
eje2<-2

plot(Psi[,eje1],Psi[,eje2])
text(Psi[,eje1],Psi[,eje2],labels=iden, cex=0.5)
axis(side=1, pos= 0, labels = F, col="cyan")
axis(side=3, pos= 0, labels = F, col="cyan")
axis(side=2, pos= 0, labels = F, col="cyan")
axis(side=4, pos= 0, labels = F, col="cyan")
```

## Plotting projection of variables, PC1 and PC2

We will create a loadings plot to see which variables most influence PC1 and PC2.

```{r}
#Projection of variables

Phi = cor(dcon,Psi)
Phi

X<-Phi[,eje1]
Y<-Phi[,eje2]

#zooms
plot(Psi[,eje1],Psi[,eje2],type="n",xlim=c(min(X,0),max(X,0)), ylim=c(-1,1))
axis(side=1, pos= 0, labels = F)
axis(side=3, pos= 0, labels = F)
axis(side=2, pos= 0, labels = F)
axis(side=4, pos= 0, labels = F)
arrows(ze, ze, X, Y, length = 0.07,col="blue")
text(X,Y,labels=etiq,col="darkblue", cex=0.7)
```

### Observations:

Based on the  Loadings plot, we see that the variables that most influence PC1 are instrumentalness, energy, loudness, and acousticness.

The variables that most influence PC2 are duration.

From this plot we can also see that these variables may be positively correlated:
- Loudness and energy
- Valence and danceability
- Instrumentalness and acousticness

These variables may be negatively correlated:
- Loudness and instrumentalness
- Loudness and acousticness
- Energy and acousticness
- Valence and instrumentalness
- Tempo and acousticness

We can also try plotting the PC1 against PC3 to see the variables that are not so clearly visible here.

## Plotting projection of variables, PC1 and PC3

Since PC1 and PC2 only accounts for 38% of the variation, we should also plot projection of variables in PC1 and PC3.

```{r}
#Projection of variables

Phi = cor(dcon,Psi)
Phi

eje3 <- 3

X<-Phi[,eje1]
Y<-Phi[,eje3]

#zooms
plot(Psi[,eje1],Psi[,eje3],type="n",xlim=c(min(X,0),max(X,0)), ylim=c(-1,1))
axis(side=1, pos= 0, labels = F)
axis(side=3, pos= 0, labels = F)
axis(side=2, pos= 0, labels = F)
axis(side=4, pos= 0, labels = F)
arrows(ze, ze, X, Y, length = 0.07,col="blue")
text(X,Y,labels=etiq,col="darkblue", cex=0.7)
```

Based on the above Loadings plot, we see that the variables that most influence PC3 are popularity, liveness and duration.

It also seems like Popularity and liveness are closely related, while it may be negatively correlated with duration.

We can also see that speechiness is closer to danceability.

## Finding Centroids

We can also find the centroids of modalities in categorical variables, using the code below.

```{r}

X<-Phi[,eje1]
Y<-Phi[,eje2]

plot(Psi[,eje1],Psi[,eje2],type="n",xlim=c(-1,1), ylim=c(-3,1))
#plot(X,Y,type="none",xlim=c(min(X,0),max(X,0)))
axis(side=1, pos= 0, labels = F, col="cyan")
axis(side=3, pos= 0, labels = F, col="cyan")
axis(side=2, pos= 0, labels = F, col="cyan")
axis(side=4, pos= 0, labels = F, col="cyan")

arrows(ze, ze, X, Y, length = 0.07,col="lightgray")
text(X,Y,labels=etiq,col="gray", cex=0.7)

#all qualitative together
plot(Psi[,eje1],Psi[,eje2],type="n",xlim=c(-2,4), ylim=c(-2,2))
axis(side=1, pos= 0, labels = F, col="cyan")
axis(side=3, pos= 0, labels = F, col="cyan")
axis(side=2, pos= 0, labels = F, col="cyan")
axis(side=4, pos= 0, labels = F, col="cyan")
arrows(ze, ze, X, Y, length = 0.07,col="lightgray")
text(X,Y,labels=etiq,col="gray", cex=0.7)

#nominal qualitative variables

dcat<-c(3, 6, 8, 16, 17)
colors<-rainbow(length(dcat))

c<-1
for(k in dcat){
  seguentColor<-colors[c]
fdic1 = tapply(Psi[,eje1],dd[,k],mean)
fdic2 = tapply(Psi[,eje2],dd[,k],mean) 

text(fdic1,fdic2,labels=levels(factor(dd[,k])),col=seguentColor, cex=0.6)
c<-c+1
}
legend("bottomleft",names(dd)[dcat],pch=1,col=colors, cex=0.6)

#add ordinal qualitative variables. Ensure ordering is the correct

dordi<-c(18)


#reorder modalities: when required
dd[,dordi[1]] <- factor(dd[,dordi[1]], ordered=TRUE,  levels= c('Larghissimo','Grave','Lento/Largo','Larghetto','Adagio','Andante','Moderato','Allegro','Vivace','Presto','Prestissimo'))
levels(dd[,dordi[1]])

c<-1
col<-length(dcat)+1
for(k in dordi){
  seguentColor<-colors[col]
  fdic1 = tapply(Psi[,eje1],dd[,k],mean)
  fdic2 = tapply(Psi[,eje2],dd[,k],mean) 
  
  #points(fdic1,fdic2,pch=16,col=seguentColor, labels=levels(dd[,k]))
  #connect modalities of qualitative variables
  lines(fdic1,fdic2,col="#000000")
  text(fdic1,fdic2,labels=levels(dd[,k]),col=seguentColor, cex=0.6)
  c<-c+1
  col<-col+1
}
legend("topleft",names(dd)[dordi],pch=19,col=colors[col:col+length(dordi)-1], cex=0.6)
```

That looks a bit crowded, so let's try looking at the modalities one by one.

### Explicit

```{r}
#all qualitative together
plot(Psi[,eje1],Psi[,eje2],type="n",xlim=c(-1,1), ylim=c(-1,1))
axis(side=1, pos= 0, labels = F, col="cyan")
axis(side=3, pos= 0, labels = F, col="cyan")
axis(side=2, pos= 0, labels = F, col="cyan")
axis(side=4, pos= 0, labels = F, col="cyan")
arrows(ze, ze, X, Y, length = 0.07,col="lightgray")
text(X,Y,labels=etiq,col="gray", cex=0.7)

#nominal qualitative variables

dcat<-c(3)
colors<-rainbow(length(dcat))

c<-1
for(k in dcat){
  seguentColor<-colors[c]
fdic1 = tapply(Psi[,eje1],dd[,k],mean)
fdic2 = tapply(Psi[,eje2],dd[,k],mean) 

text(fdic1,fdic2,labels=levels(factor(dd[,k])),col=seguentColor, cex=0.6)
c<-c+1
}
legend("bottomleft",names(dd)[dcat],pch=1,col=colors, cex=0.6)


```

### Key

```{r}
#all qualitative together
plot(Psi[,eje1],Psi[,eje2],type="n",xlim=c(-1,1), ylim=c(-1,1))
axis(side=1, pos= 0, labels = F, col="cyan")
axis(side=3, pos= 0, labels = F, col="cyan")
axis(side=2, pos= 0, labels = F, col="cyan")
axis(side=4, pos= 0, labels = F, col="cyan")
arrows(ze, ze, X, Y, length = 0.07,col="lightgray")
text(X,Y,labels=etiq,col="gray", cex=0.7)

#nominal qualitative variables

dcat<-c(6)
colors<-rainbow(length(dcat))

c<-1
for(k in dcat){
  seguentColor<-colors[c]
fdic1 = tapply(Psi[,eje1],dd[,k],mean)
fdic2 = tapply(Psi[,eje2],dd[,k],mean) 

text(fdic1,fdic2,labels=levels(factor(dd[,k])),col=seguentColor, cex=0.6)
c<-c+1
}
legend("bottomleft",names(dd)[dcat],pch=1,col=colors, cex=0.6)


```


## Mode

```{r}
#all qualitative together
plot(Psi[,eje1],Psi[,eje2],type="n",xlim=c(-1,1), ylim=c(-1,1))
axis(side=1, pos= 0, labels = F, col="cyan")
axis(side=3, pos= 0, labels = F, col="cyan")
axis(side=2, pos= 0, labels = F, col="cyan")
axis(side=4, pos= 0, labels = F, col="cyan")
arrows(ze, ze, X, Y, length = 0.07,col="lightgray")
text(X,Y,labels=etiq,col="gray", cex=0.7)

#nominal qualitative variables

dcat<-c(8)
colors<-rainbow(length(dcat))

c<-1
for(k in dcat){
  seguentColor<-colors[c]
fdic1 = tapply(Psi[,eje1],dd[,k],mean)
fdic2 = tapply(Psi[,eje2],dd[,k],mean) 

text(fdic1,fdic2,labels=levels(factor(dd[,k])),col=seguentColor, cex=0.6)
c<-c+1
}
legend("bottomleft",names(dd)[dcat],pch=1,col=colors, cex=0.6)


```
## Multiple artists

```{r}
#all qualitative together
plot(Psi[,eje1],Psi[,eje2],type="n",xlim=c(-1,3), ylim=c(-2,2))
axis(side=1, pos= 0, labels = F, col="cyan")
axis(side=3, pos= 0, labels = F, col="cyan")
axis(side=2, pos= 0, labels = F, col="cyan")
axis(side=4, pos= 0, labels = F, col="cyan")
arrows(ze, ze, X, Y, length = 0.07,col="lightgray")
text(X,Y,labels=etiq,col="gray", cex=0.7)

#nominal qualitative variables

dcat<-c(17)
colors<-rainbow(length(dcat))

c<-1
for(k in dcat){
  seguentColor<-colors[c]
fdic1 = tapply(Psi[,eje1],dd[,k],mean)
fdic2 = tapply(Psi[,eje2],dd[,k],mean) 

text(fdic1,fdic2,labels=levels(factor(dd[,k])),col=seguentColor, cex=0.6)
c<-c+1
}
legend("bottomleft",names(dd)[dcat],pch=1,col=colors, cex=0.6)


```


All except genre
```{r}

#all qualitative together
plot(Psi[,eje1],Psi[,eje2],type="n",xlim=c(-1,3), ylim=c(-2,2))
axis(side=1, pos= 0, labels = F, col="cyan")
axis(side=3, pos= 0, labels = F, col="cyan")
axis(side=2, pos= 0, labels = F, col="cyan")
axis(side=4, pos= 0, labels = F, col="cyan")
arrows(ze, ze, X, Y, length = 0.07,col="lightgray")
text(X,Y,labels=etiq,col="gray", cex=0.7)

#nominal qualitative variables

dcat<-c(3,6,8,17)
colors<-rainbow(length(dcat))

c<-1
for(k in dcat){
  seguentColor<-colors[c]
fdic1 = tapply(Psi[,eje1],dd[,k],mean)
fdic2 = tapply(Psi[,eje2],dd[,k],mean) 

text(fdic1,fdic2,labels=levels(factor(dd[,k])),col=seguentColor, cex=0.6)
c<-c+1
}
legend("bottomleft",names(dd)[dcat],pch=1,col=colors, cex=0.6)



#add ordinal qualitative variables. Ensure ordering is the correct

dordi<-c(18)


#reorder modalities: when required
dd[,dordi[1]] <- factor(dd[,dordi[1]], ordered=TRUE,  levels= c('Larghissimo','Grave','Lento/Largo','Larghetto','Adagio','Andante','Moderato','Allegro','Vivace','Presto','Prestissimo'))
levels(dd[,dordi[1]])

c<-1
col<-length(dcat)+1
for(k in dordi){
  seguentColor<-colors[col]
  fdic1 = tapply(Psi[,eje1],dd[,k],mean)
  fdic2 = tapply(Psi[,eje2],dd[,k],mean) 
  
  #points(fdic1,fdic2,pch=16,col=seguentColor, labels=levels(dd[,k]))
  #connect modalities of qualitative variables
  lines(fdic1,fdic2,col="#000000")
  text(fdic1,fdic2,labels=levels(dd[,k]),col=seguentColor, cex=0.6)
  c<-c+1
  col<-col+1
}
legend("topleft",names(dd)[dordi],pch=19,col=colors[col:col+length(dordi)-1], cex=0.6)


```

```{r}

#all qualitative together
plot(Psi[,eje1],Psi[,eje2],type="n",xlim=c(-1,3), ylim=c(-1,1))
axis(side=1, pos= 0, labels = F, col="cyan")
axis(side=3, pos= 0, labels = F, col="cyan")
axis(side=2, pos= 0, labels = F, col="cyan")
axis(side=4, pos= 0, labels = F, col="cyan")
arrows(ze, ze, X, Y, length = 0.07,col="lightgray")
text(X,Y,labels=etiq,col="gray", cex=0.7)


#add ordinal qualitative variables. Ensure ordering is the correct

dordi<-c(18)


#reorder modalities: when required
dd[,dordi[1]] <- factor(dd[,dordi[1]], ordered=TRUE,  levels= c('Larghissimo','Grave','Lento/Largo','Larghetto','Adagio','Andante','Moderato','Allegro','Vivace','Presto','Prestissimo'))
levels(dd[,dordi[1]])

c<-1
col<-length(dcat)+1
for(k in dordi){
  seguentColor<-colors[col]
  fdic1 = tapply(Psi[,eje1],dd[,k],mean)
  fdic2 = tapply(Psi[,eje2],dd[,k],mean) 
  
  #points(fdic1,fdic2,pch=16,col=seguentColor, labels=levels(dd[,k]))
  #connect modalities of qualitative variables
  lines(fdic1,fdic2,col="#000000")
  text(fdic1,fdic2,labels=levels(dd[,k]),col=seguentColor, cex=0.6)
  c<-c+1
  col<-col+1
}
legend("topleft",names(dd)[dordi],pch=19,col=colors[col:col+length(dordi)-1], cex=0.6)


```


Observations from the plot of centroids above:

1. Songs that are explicit are closely related to heavy metal and grunge songs.
2. Disney, jazz, and classical-tonk songs are probably high on the instrumental scale, while new-age, ambient, sleep songs are high on the acousticness scale.
3. Seems like the slower songs (Grave, Lento, Larghetto, Adagio) are also on the right side of the plot, which are more acoustic, instrumental songs. Whereas the faster songs are on the left side.

## Coloring the PCA plot using categorical variables

We can also plot all the individuals in PC1 and PC2 as axes, and color-code it by categorical variables.
We'll examine one categorical variable: Explicitness

```{r}

# PROJECTION OF ILLUSTRATIVE qualitative variables on individuals' map
varcat=factor(dd[,3])
plot(Psi[,1],Psi[,2],col=c("grey", "red")[varcat])
axis(side=1, pos= 0, labels = F, col="darkgray")
axis(side=3, pos= 0, labels = F, col="darkgray")
axis(side=2, pos= 0, labels = F, col="darkgray")
axis(side=4, pos= 0, labels = F, col="darkgray")
legend("bottomleft",levels(varcat),pch=1,col=c("grey", "red"), cex=0.6)


# Overproject THE CDG OF  LEVELS OF varcat
fdic1 = tapply(Psi[,1],varcat,mean)
fdic2 = tapply(Psi[,2],varcat,mean) 

text(fdic1,fdic2,labels=levels(factor(varcat)),col="cyan", cex=0.75)

```

Although there are not many explicit songs, we can see that the explicit songs tend to be on the left side of PC1.


```{r}

# PROJECTION OF ILLUSTRATIVE qualitative variables on individuals' map
varcat=factor(dd[,18])
plot(Psi[,1],Psi[,2],col=rainbow(9)[varcat])
axis(side=1, pos= 0, labels = F, col="darkgray")
axis(side=3, pos= 0, labels = F, col="darkgray")
axis(side=2, pos= 0, labels = F, col="darkgray")
axis(side=4, pos= 0, labels = F, col="darkgray")
legend("bottomleft",levels(varcat),pch=1,col=rainbow(9), cex=0.6)


# Overproject THE CDG OF  LEVELS OF varcat
fdic1 = tapply(Psi[,1],varcat,mean)
fdic2 = tapply(Psi[,2],varcat,mean) 

text(fdic1,fdic2,labels=levels(factor(varcat)),col="cyan", cex=0.75)

```

From the tempo categories plot we see that the left side of PCA is the faster songs, and the right side are the slower songs.
